Safe Policies for Reinforcement Learning via Primal-Dual Methods
نویسندگان
چکیده
In this article, we study the design of controllers in context stochastic optimal control under assumption that model system is not available. This is, aim to a Markov decision process which do know transition probabilities, but have access sample trajectories through experience. We define safety as agent remaining desired safe set with high probability during operation time. The drawbacks formulation are twofold. problem nonconvex and computing gradients constraints respect policies prohibitive. Hence, propose an ergodic relaxation following advantages. 1) guarantees maintained case episodic tasks they hold until given time horizon for continuing tasks. 2) constrained optimization despite its nonconvexity has arbitrarily small duality gap if parametrization controller rich enough. 3) Lagrangian associated learning can be computed using standard reinforcement results approximation tools. Leveraging these advantages, exploit primal-dual algorithms find optimal. test proposed approach navigation task continuous domain. numerical show our algorithm capable dynamically adapting policy environment required levels.
منابع مشابه
Accelerated Primal-Dual Policy Optimization for Safe Reinforcement Learning
Constrained Markov Decision Process (CMDP) is a natural framework for reinforcement learning tasks with safety constraints, where agents learn a policy that maximizes the long-term reward while satisfying the constraints on the long-term cost. A canonical approach for solving CMDPs is the primal-dual method which updates parameters in primal and dual spaces in turn. Existing methods for CMDPs o...
متن کاملStochastic Primal-Dual Methods and Sample Complexity of Reinforcement Learning
We study the online estimation of the optimal policy of a Markov decision process (MDP). We propose a class of Stochastic Primal-Dual (SPD) methods which exploit the inherent minimax duality of Bellman equations. The SPD methods update a few coordinates of the value and policy estimates as a new state transition is observed. These methods use small storage and has low computational complexity p...
متن کاملSafe Reinforcement Learning via Shielding
Reinforcement learning algorithms discover policies that maximize reward, but do not necessarily guarantee safety during learning or execution phases. We introduce a new approach to learn optimal policies while enforcing properties expressed in temporal logic. To this end, given the temporal logic specification that is to be obeyed by the learning system, we propose to synthesize a reactive sys...
متن کاملSafe Reinforcement Learning via Formal Methods Toward Safe Control Through Proof and Learning
Formal verification provides a high degree of confidence in safe system operation, but only if reality matches the verified model. Although a good model will be accurate most of the time, even the best models are incomplete. This is especially true in Cyber-Physical Systems because high-fidelity physical models of systems are expensive to develop and often intractable to verify. Conversely, rei...
متن کاملSafe exploration for reinforcement learning
In this paper we define and address the problem of safe exploration in the context of reinforcement learning. Our notion of safety is concerned with states or transitions that can lead to damage and thus must be avoided. We introduce the concepts of a safety function for determining a state’s safety degree and that of a backup policy that is able to lead the controlled system from a critical st...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Automatic Control
سال: 2023
ISSN: ['0018-9286', '1558-2523', '2334-3303']
DOI: https://doi.org/10.1109/tac.2022.3152724